
import os
import sys
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as npr
import pandas as pd
from sklearn.compose import (
ColumnTransformer,
TransformedTargetRegressor,
make_column_transformer,
)
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.svm import SVC
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
From this lecture, students are expected to be able to:
sklearn's implementation of model-based selection and recursive feature elimination (RFE)Select the most accurate option below.
Suppose you are working on a machine learning project. If you have to prioritize one of the following in your project which of the following would it be?
Discussion question
Upon analyzing the data, you notice a pattern: flights tend to be delayed more often during the evening rush hours. What feature could be valuable to add for this prediction task?
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
- Jason Brownlee
A quote by Pedro Domingos A Few Useful Things to Know About Machine Learning
... At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.
A quote by Andrew Ng, Machine Learning and AI via Brain simulations
Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.
In this lecture, I'll show you an example of feature engineering on text data.
Is the following dataset (XOR function) linearly separable?
| $$x_1$$ | $$x_2$$ | target |
|---|---|---|
| 1 | 1 | 0 |
| -1 | 1 | 1 |
| 1 | -1 | 1 |
| -1 | -1 | 0 |
import seaborn as sb
X = np.array([
[-1, -1],
[1, -1],
[-1, 1],
[1, 1]
])
y = np.array([1, 0, 0, 1])
df = pd.DataFrame(np.column_stack([X, y]), columns=["X1", "X2", "target"])
plt.figure(figsize=(4, 4))
sb.scatterplot(data=df, x="X1", y="X2", style="target", s=200, legend=False);
| $$x_1$$ | $$x_2$$ | $$x_1x_2$$ | target |
|---|---|---|---|
| 1 | 1 | 1 | 0 |
| -1 | 1 | -1 | 1 |
| 1 | -1 | -1 | 1 |
| -1 | -1 | 1 | 0 |
df["X1X2"] = df["X1"] * df["X2"]
df
| X1 | X2 | target | X1X2 | |
|---|---|---|---|---|
| 0 | -1 | -1 | 1 | 1 |
| 1 | 1 | -1 | 0 | -1 |
| 2 | -1 | 1 | 0 | -1 |
| 3 | 1 | 1 | 1 | 1 |
plt.figure(figsize=(4, 4))
sb.scatterplot(data=df, x="X2", y="X1X2", style="target", s=200, legend=False);
Let's look at an example with more data points.
xx, yy = np.meshgrid(np.linspace(-3, 3, 50), np.linspace(-3, 3, 50))
rng = np.random.RandomState(0)
rng.randn(4, 2) # example output
array([[ 1.76405235, 0.40015721],
[ 0.97873798, 2.2408932 ],
[ 1.86755799, -0.97727788],
[ 0.95008842, -0.15135721]])
X_xor = rng.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0, X_xor[:, 1] > 0)
# Interaction term
Z = X_xor[:, 0] * X_xor[:, 1]
df = pd.DataFrame({'X': X_xor[:, 0], 'Y': X_xor[:, 1], 'Z': Z, 'Class': y_xor})
df.head()
| X | Y | Z | Class | |
|---|---|---|---|---|
| 0 | -0.103219 | 0.410599 | -0.042382 | True |
| 1 | 0.144044 | 1.454274 | 0.209479 | False |
| 2 | 0.761038 | 0.121675 | 0.092599 | False |
| 3 | 0.443863 | 0.333674 | 0.148106 | False |
| 4 | 1.494079 | -0.205158 | -0.306523 | True |
plt.scatter(df[df['Class'] == True]['X'], df[df['Class'] == True]['Y'], c='blue', label='Class 0', s=50)
plt.scatter(df[df['Class'] == False]['X'], df[df['Class'] == False]['Y'], c='red', label='Class 0', s=50);
# Create an interactive 3D scatter plot using plotly
import plotly.express as px
fig = px.scatter_3d(df, x='X', y='Y', z='Z', color='Class', color_continuous_scale=['blue', 'red'])
fig.show()
LogisticRegression().fit(X_xor, y_xor).score(X_xor, y_xor)
0.6
from sklearn.preprocessing import PolynomialFeatures
pipe_xor = make_pipeline(
PolynomialFeatures(interaction_only=True, include_bias=False), LogisticRegression()
)
pipe_xor.fit(X_xor, y_xor)
pipe_xor.score(X_xor, y_xor)
0.985
feature_names = (
pipe_xor.named_steps["polynomialfeatures"].get_feature_names_out().tolist()
)
# transformed = pipe_xor.named_steps["polynomialfeatures"].transform(X_xor)
pd.DataFrame(
pipe_xor.named_steps["logisticregression"].coef_.transpose(),
index=feature_names,
columns=["Feature coefficient"],
)
| Feature coefficient | |
|---|---|
| x0 | -0.101041 |
| x1 | 0.134703 |
| x0 x1 | -5.109696 |
The interaction feature has the biggest coefficient!
median_house_value for a given property.housing_df = pd.read_csv("../data/california_housing.csv")
housing_df.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
housing_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
Suppose we decide to train ridge model on this dataset.
ocean_proximity?ocean_proximity but we do not scale the features?In this section, we will look into some common ways to do feature engineering for numeric or categorical features.
train_df, test_df = train_test_split(housing_df, test_size=0.2, random_state=123)
We have total rooms and the number of households in the neighbourhood. How about creating rooms_per_household feature using this information?
train_df = train_df.assign(
rooms_per_household=train_df["total_rooms"] / train_df["households"]
)
test_df = test_df.assign(
rooms_per_household=test_df["total_rooms"] / test_df["households"]
)
train_df
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | rooms_per_household | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9950 | -122.33 | 38.38 | 28.0 | 1020.0 | 169.0 | 504.0 | 164.0 | 4.5694 | 287500.0 | INLAND | 6.219512 |
| 3547 | -118.60 | 34.26 | 18.0 | 6154.0 | 1070.0 | 3010.0 | 1034.0 | 5.6392 | 271500.0 | <1H OCEAN | 5.951644 |
| 4448 | -118.21 | 34.07 | 47.0 | 1346.0 | 383.0 | 1452.0 | 371.0 | 1.7292 | 191700.0 | <1H OCEAN | 3.628032 |
| 6984 | -118.02 | 33.96 | 36.0 | 2071.0 | 398.0 | 988.0 | 404.0 | 4.6226 | 219700.0 | <1H OCEAN | 5.126238 |
| 4432 | -118.20 | 34.08 | 49.0 | 1320.0 | 309.0 | 1405.0 | 328.0 | 2.4375 | 114000.0 | <1H OCEAN | 4.024390 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7763 | -118.10 | 33.91 | 36.0 | 726.0 | NaN | 490.0 | 130.0 | 3.6389 | 167600.0 | <1H OCEAN | 5.584615 |
| 15377 | -117.24 | 33.37 | 14.0 | 4687.0 | 793.0 | 2436.0 | 779.0 | 4.5391 | 180900.0 | <1H OCEAN | 6.016688 |
| 17730 | -121.76 | 37.33 | 5.0 | 4153.0 | 719.0 | 2435.0 | 697.0 | 5.6306 | 286200.0 | <1H OCEAN | 5.958393 |
| 15725 | -122.44 | 37.78 | 44.0 | 1545.0 | 334.0 | 561.0 | 326.0 | 3.8750 | 412500.0 | NEAR BAY | 4.739264 |
| 19966 | -119.08 | 36.21 | 20.0 | 1911.0 | 389.0 | 1241.0 | 348.0 | 2.5156 | 59300.0 | INLAND | 5.491379 |
16512 rows × 11 columns
Let's start simple. Imagine that we only three features: longitude, latitude, and our newly created rooms_per_household feature.
X_train_housing = train_df[["latitude", "longitude", "rooms_per_household"]]
y_train_housing = train_df["median_house_value"]
from sklearn.compose import make_column_transformer
numeric_feats = ["latitude", "longitude", "rooms_per_household"]
preprocessor1 = make_column_transformer(
(make_pipeline(SimpleImputer(), StandardScaler()), numeric_feats)
)
lr_1 = make_pipeline(preprocessor1, Ridge())
pd.DataFrame(
cross_validate(lr_1, X_train_housing, y_train_housing, return_train_score=True)
)
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 0 | 0.015996 | 0.151829 | 0.280028 | 0.311769 |
| 1 | 0.048619 | 0.018473 | 0.325319 | 0.300464 |
| 2 | 0.037484 | 0.018449 | 0.317277 | 0.301952 |
| 3 | 0.027465 | 0.015590 | 0.316798 | 0.303004 |
| 4 | 0.037017 | 0.033977 | 0.260258 | 0.314840 |
plt.figure(figsize=(6, 4), dpi=80)
plt.hist(train_df["longitude"], bins=50)
plt.title("Distribution of longitude feature");
plt.figure(figsize=(6, 4), dpi=80)
plt.hist(train_df["latitude"], bins=50)
plt.title("Distribution of latitude feature");
sklearn you can do this using KBinsDiscretizer transformer.from sklearn.preprocessing import KBinsDiscretizer
discretization_feats = ["latitude", "longitude"]
numeric_feats = ["rooms_per_household"]
preprocessor2 = make_column_transformer(
(KBinsDiscretizer(n_bins=20, encode="onehot"), discretization_feats),
(make_pipeline(SimpleImputer(), StandardScaler()), numeric_feats),
)
lr_2 = make_pipeline(preprocessor2, Ridge())
pd.DataFrame(
cross_validate(lr_2, X_train_housing, y_train_housing, return_train_score=True)
)
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 0 | 0.172392 | 0.029055 | 0.441445 | 0.456419 |
| 1 | 0.149986 | 0.025023 | 0.469571 | 0.446216 |
| 2 | 0.177593 | 0.029383 | 0.479132 | 0.446869 |
| 3 | 0.156570 | 0.034505 | 0.450822 | 0.453367 |
| 4 | 0.157785 | 0.027586 | 0.388169 | 0.467628 |
The results are better with binned features. Let's examine how do these binned features look like.
lr_2.fit(X_train_housing, y_train_housing)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('kbinsdiscretizer',
KBinsDiscretizer(n_bins=20),
['latitude', 'longitude']),
('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['rooms_per_household'])])),
('ridge', Ridge())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('kbinsdiscretizer',
KBinsDiscretizer(n_bins=20),
['latitude', 'longitude']),
('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['rooms_per_household'])])),
('ridge', Ridge())])ColumnTransformer(transformers=[('kbinsdiscretizer',
KBinsDiscretizer(n_bins=20),
['latitude', 'longitude']),
('pipeline',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['rooms_per_household'])])['latitude', 'longitude']
KBinsDiscretizer(n_bins=20)
['rooms_per_household']
SimpleImputer()
StandardScaler()
Ridge()
pd.DataFrame(
preprocessor2.fit_transform(X_train_housing).todense(),
columns=preprocessor2.get_feature_names_out(),
)
| kbinsdiscretizer__latitude_0.0 | kbinsdiscretizer__latitude_1.0 | kbinsdiscretizer__latitude_2.0 | kbinsdiscretizer__latitude_3.0 | kbinsdiscretizer__latitude_4.0 | kbinsdiscretizer__latitude_5.0 | kbinsdiscretizer__latitude_6.0 | kbinsdiscretizer__latitude_7.0 | kbinsdiscretizer__latitude_8.0 | kbinsdiscretizer__latitude_9.0 | ... | kbinsdiscretizer__longitude_11.0 | kbinsdiscretizer__longitude_12.0 | kbinsdiscretizer__longitude_13.0 | kbinsdiscretizer__longitude_14.0 | kbinsdiscretizer__longitude_15.0 | kbinsdiscretizer__longitude_16.0 | kbinsdiscretizer__longitude_17.0 | kbinsdiscretizer__longitude_18.0 | kbinsdiscretizer__longitude_19.0 | pipeline__rooms_per_household | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.316164 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.209903 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.711852 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.117528 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.554621 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 16507 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.064307 |
| 16508 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.235706 |
| 16509 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.212581 |
| 16510 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.271037 |
| 16511 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.027321 |
16512 rows × 41 columns
How about discretizing all three features?
from sklearn.preprocessing import KBinsDiscretizer
discretization_feats = ["latitude", "longitude", "rooms_per_household"]
preprocessor3 = make_column_transformer(
(KBinsDiscretizer(n_bins=20, encode="onehot"), discretization_feats),
)
lr_3 = make_pipeline(preprocessor3, Ridge())
pd.DataFrame(
cross_validate(lr_3, X_train_housing, y_train_housing, return_train_score=True)
)
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 0 | 0.176174 | 0.027912 | 0.590618 | 0.571969 |
| 1 | 0.232690 | 0.030871 | 0.575907 | 0.570473 |
| 2 | 0.102713 | 0.017999 | 0.579091 | 0.573542 |
| 3 | 0.130407 | 0.023641 | 0.571500 | 0.574260 |
| 4 | 0.153937 | 0.019982 | 0.541488 | 0.581687 |
lr_3.fit(X_train_housing, y_train_housing)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('kbinsdiscretizer',
KBinsDiscretizer(n_bins=20),
['latitude', 'longitude',
'rooms_per_household'])])),
('ridge', Ridge())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('kbinsdiscretizer',
KBinsDiscretizer(n_bins=20),
['latitude', 'longitude',
'rooms_per_household'])])),
('ridge', Ridge())])ColumnTransformer(transformers=[('kbinsdiscretizer',
KBinsDiscretizer(n_bins=20),
['latitude', 'longitude',
'rooms_per_household'])])['latitude', 'longitude', 'rooms_per_household']
KBinsDiscretizer(n_bins=20)
Ridge()
feature_names = (
lr_3.named_steps["columntransformer"]
.named_transformers_["kbinsdiscretizer"]
.get_feature_names_out()
)
lr_3.named_steps["ridge"].coef_.shape
(60,)
coefs_df = pd.DataFrame(
lr_3.named_steps["ridge"].coef_.transpose(),
index=feature_names,
columns=["coefficient"],
).sort_values("coefficient", ascending=False)
coefs_df.head(10)
| coefficient | |
|---|---|
| longitude_1.0 | 211343.036136 |
| latitude_1.0 | 205059.296601 |
| latitude_0.0 | 201862.534342 |
| longitude_0.0 | 190319.721818 |
| longitude_2.0 | 160282.191204 |
| longitude_3.0 | 157234.920305 |
| latitude_2.0 | 154105.963689 |
| rooms_per_household_19.0 | 138503.477291 |
| latitude_8.0 | 135299.516394 |
| longitude_4.0 | 132292.924485 |
We will be using Covid tweets dataset for this.
df = pd.read_csv('../data/Corona_NLP_test.csv')
df['Sentiment'].value_counts()
Sentiment Negative 1041 Positive 947 Neutral 619 Extremely Positive 599 Extremely Negative 592 Name: count, dtype: int64
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
train_df
| UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
|---|---|---|---|---|---|---|
| 1927 | 1928 | 46880 | Seattle, WA | 13-03-2020 | While I don't like all of Amazon's choices, to... | Positive |
| 1068 | 1069 | 46021 | NaN | 13-03-2020 | Me: shit buckets, its time to do the weekly s... | Negative |
| 803 | 804 | 45756 | The Outer Limits | 12-03-2020 | @SecPompeo @realDonaldTrump You mean the plan ... | Neutral |
| 2846 | 2847 | 47799 | Flagstaff, AZ | 15-03-2020 | @lauvagrande People who are sick arent panic ... | Extremely Negative |
| 3768 | 3769 | 48721 | Montreal, Canada | 16-03-2020 | Coronavirus Panic: Toilet Paper Is the People... | Negative |
| ... | ... | ... | ... | ... | ... | ... |
| 1122 | 1123 | 46075 | NaN | 13-03-2020 | Photos of our local grocery store shelveswher... | Extremely Positive |
| 1346 | 1347 | 46299 | Toronto | 13-03-2020 | Just went to the the grocery store (Highland F... | Positive |
| 3454 | 3455 | 48407 | Houston, TX | 16-03-2020 | Real talk though. Am I the only one spending h... | Neutral |
| 3437 | 3438 | 48390 | Washington, DC | 16-03-2020 | The supermarket business is booming! #COVID2019 | Neutral |
| 3582 | 3583 | 48535 | St James' Park, Newcastle | 16-03-2020 | Evening All Here s the story on the and the im... | Positive |
3038 rows × 6 columns
train_df.columns
Index(['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet',
'Sentiment'],
dtype='object')
train_df['Location'].value_counts()
Location
United States 63
London, England 37
Los Angeles, CA 30
New York, NY 29
Washington, DC 29
..
Suburb of Chicago 1
philippines 1
Dont ask for freedom, take it. 1
Windsor Heights, IA 1
St James' Park, Newcastle 1
Name: count, Length: 1441, dtype: int64
X_train, y_train = train_df[['OriginalTweet', 'Location']], train_df['Sentiment']
X_test, y_test = test_df[['OriginalTweet', 'Location']], test_df['Sentiment']
y_train.value_counts()
Sentiment Negative 852 Positive 743 Neutral 501 Extremely Negative 472 Extremely Positive 470 Name: count, dtype: int64
scoring_metrics = 'accuracy'
results = {}
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
"""
Returns mean and std of cross validation
Parameters
----------
model :
scikit-learn model
X_train : numpy array or pandas DataFrame
X in the training data
y_train :
y in the training data
Returns
----------
pandas Series with mean scores from cross_validation
"""
scores = cross_validate(model, X_train, y_train, **kwargs)
mean_scores = pd.DataFrame(scores).mean()
std_scores = pd.DataFrame(scores).std()
out_col = []
for i in range(len(mean_scores)):
out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores.iloc[i], std_scores.iloc[i])))
return pd.Series(data=out_col, index=mean_scores.index)
dummy = DummyClassifier()
results["dummy"] = mean_std_cross_val_scores(
dummy, X_train, y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| dummy | 0.004 (+/- 0.001) | 0.003 (+/- 0.001) | 0.280 (+/- 0.001) | 0.280 (+/- 0.000) |
from sklearn.feature_extraction.text import CountVectorizer
pipe = make_pipeline(CountVectorizer(stop_words='english'),
LogisticRegression(max_iter=1000))
results["logistic regression"] = mean_std_cross_val_scores(
pipe, X_train['OriginalTweet'], y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| dummy | 0.004 (+/- 0.001) | 0.003 (+/- 0.001) | 0.280 (+/- 0.001) | 0.280 (+/- 0.000) |
| logistic regression | 4.959 (+/- 1.636) | 0.098 (+/- 0.011) | 0.413 (+/- 0.011) | 0.999 (+/- 0.000) |
How about adding new features based on our intuitions? Let's extract our own features that might be useful for this prediction task. In other words, let's carry out feature engineering.
The code below adds some very basic length-related and sentiment features. We will be using a popular library called nltk for this exercise. If you have successfully created the course conda environment on your machine, you should already have this package in the environment.
nltkconda install -n cpsc330 -c anaconda nltk
conda install -n cpsc330 -c conda-forge spacy
For emoji support:
pip install spacymoji
conda environment or here.import spacy
# !python -m spacy download en_core_web_md
import nltk
nltk.download("punkt")
[nltk_data] Downloading package punkt to /home/mehrdad/nltk_data... [nltk_data] Package punkt is already up-to-date!
True
nltk.download("vader_lexicon")
nltk.download("punkt")
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
[nltk_data] Downloading package vader_lexicon to [nltk_data] /home/mehrdad/nltk_data... [nltk_data] Package vader_lexicon is already up-to-date! [nltk_data] Downloading package punkt to /home/mehrdad/nltk_data... [nltk_data] Package punkt is already up-to-date!
s = "CPSC 330 students are smart and funny."
print(sid.polarity_scores(s))
{'neg': 0.0, 'neu': 0.472, 'pos': 0.528, 'compound': 0.6808}
s = "CPSC 330 students are tired because of all the hard work they have been doing."
print(sid.polarity_scores(s))
{'neg': 0.249, 'neu': 0.751, 'pos': 0.0, 'compound': -0.5106}
A useful package for text processing and feature extraction
import en_core_web_md # pre-trained model
import spacy
nlp = en_core_web_md.load()
sample_text = """Dolly Parton is a gift to us all.
From writing all-time great songs like “Jolene” and “I Will Always Love You”,
to great performances in films like 9 to 5, to helping fund a COVID-19 vaccine,
she’s given us so much. Now, Netflix bring us Dolly Parton’s Christmas on the Square,
an original musical that stars Christine Baranski as a Scrooge-like landowner
who threatens to evict an entire town on Christmas Eve to make room for a new mall.
Directed and choreographed by the legendary Debbie Allen and counting Jennifer Lewis
and Parton herself amongst its cast, Christmas on the Square seems like the perfect movie
to save Christmas 2020. 😻 👍🏿"""
# [Adapted from here.](https://thepopbreak.com/2020/11/22/dolly-partons-christmas-on-the-square-review-not-quite-a-christmas-miracle/)
Spacy extracts all interesting information from text with this call.
doc = nlp(sample_text)
Let's look at part-of-speech tags.
print([(token, token.pos_) for token in doc][:20])
[(Dolly, 'PROPN'), (Parton, 'PROPN'), (is, 'AUX'), (a, 'DET'), (gift, 'NOUN'), (to, 'ADP'), (us, 'PRON'), (all, 'PRON'), (., 'PUNCT'), ( , 'SPACE'), (From, 'ADP'), (writing, 'VERB'), (all, 'DET'), (-, 'PUNCT'), (time, 'NOUN'), (great, 'ADJ'), (songs, 'NOUN'), (like, 'ADP'), (“, 'PUNCT'), (Jolene, 'PROPN')]
from spacy import displacy
displacy.render(doc, style="ent")
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
print("\nORG means: ", spacy.explain("ORG"))
print("\nPERSON means: ", spacy.explain("PERSON"))
print("\nDATE means: ", spacy.explain("DATE"))
Named entities:
[('Dolly Parton', 'PERSON'), ('Jolene', 'PERSON'), ('9 to 5', 'DATE'), ('Netflix', 'ORG'), ('Dolly Parton', 'PERSON'), ('Christmas', 'DATE'), ('Square', 'FAC'), ('Christine Baranski', 'PERSON'), ('Christmas Eve', 'DATE'), ('Debbie Allen', 'PERSON'), ('Jennifer Lewis', 'PERSON'), ('Parton', 'PERSON'), ('Christmas', 'DATE'), ('Square', 'FAC'), ('Christmas 2020', 'DATE')]
ORG means: Companies, agencies, institutions, etc.
PERSON means: People, including fictional
DATE means: Absolute or relative dates or periods
Goal: Extract and visualize inter-corporate relationships from disclosed annual 10-K reports of public companies.
text = (
"Heavy hitters, including Microsoft and Google, "
"are competing for customers in cloud services with the likes of IBM and Salesforce."
)
doc = nlp(text)
displacy.render(doc, style="ent")
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
Named entities:
[('Microsoft', 'ORG'), ('Google', 'ORG'), ('IBM', 'ORG'), ('Salesforce', 'PRODUCT')]
If you want emoji identification support install spacymoji in the course environment.
pip install spacymoji
After installing spacymoji, if it's still complaining about module not found, my guess is that you do not have pip installed in your conda environment. Go to your course conda environment install pip and install the spacymoji package in the environment using the pip you just installed in the current environment.
conda install pip
YOUR_MINICONDA_PATH/miniconda3/envs/cpsc330/bin/pip install spacymoji
from spacymoji import Emoji
nlp.add_pipe("emoji", first=True);
Does the text have any emojis? If yes, extract the description.
doc = nlp(sample_text)
doc._.emoji
[('😻', 138, 'smiling cat with heart-eyes'),
('👍🏿', 139, 'thumbs up dark skin tone')]
import en_core_web_md
import spacy
nlp = en_core_web_md.load()
from spacymoji import Emoji
nlp.add_pipe("emoji", first=True)
def get_relative_length(text, TWITTER_ALLOWED_CHARS=280.0):
"""
Returns the relative length of text.
Parameters:
------
text: (str)
the input text
Keyword arguments:
------
TWITTER_ALLOWED_CHARS: (float)
the denominator for finding relative length
Returns:
-------
relative length of text: (float)
"""
return len(text) / TWITTER_ALLOWED_CHARS
def get_length_in_words(text):
"""
Returns the length of the text in words.
Parameters:
------
text: (str)
the input text
Returns:
-------
length of tokenized text: (int)
"""
return len(nltk.word_tokenize(text))
def get_sentiment(text):
"""
Returns the compound score representing the sentiment: -1 (most extreme negative) and +1 (most extreme positive)
The compound score is a normalized score calculated by summing the valence scores of each word in the lexicon.
Parameters:
------
text: (str)
the input text
Returns:
-------
sentiment of the text: (str)
"""
scores = sid.polarity_scores(text)
return scores["compound"]
def get_avg_word_length(text):
"""
Returns the average word length of the given text.
Parameters:
text -- (str)
"""
words = text.split()
return sum(len(word) for word in words) / len(words)
def has_emoji(text):
"""
Returns the average word length of the given text.
Parameters:
text -- (str)
"""
doc = nlp(text)
return 1 if doc._.has_emoji else 0
train_df = train_df.assign(n_words=train_df["OriginalTweet"].apply(get_length_in_words))
train_df = train_df.assign(vader_sentiment=train_df["OriginalTweet"].apply(get_sentiment))
train_df = train_df.assign(rel_char_len=train_df["OriginalTweet"].apply(get_relative_length))
test_df = test_df.assign(n_words=test_df["OriginalTweet"].apply(get_length_in_words))
test_df = test_df.assign(vader_sentiment=test_df["OriginalTweet"].apply(get_sentiment))
test_df = test_df.assign(rel_char_len=test_df["OriginalTweet"].apply(get_relative_length))
train_df = train_df.assign(
average_word_length=train_df["OriginalTweet"].apply(get_avg_word_length)
)
test_df = test_df.assign(average_word_length=test_df["OriginalTweet"].apply(get_avg_word_length))
# whether all letters are uppercase or not (all_caps)
train_df = train_df.assign(
all_caps=train_df["OriginalTweet"].apply(lambda x: 1 if x.isupper() else 0)
)
test_df = test_df.assign(
all_caps=test_df["OriginalTweet"].apply(lambda x: 1 if x.isupper() else 0)
)
train_df = train_df.assign(has_emoji=train_df["OriginalTweet"].apply(has_emoji))
test_df = test_df.assign(has_emoji=test_df["OriginalTweet"].apply(has_emoji))
train_df.head()
| UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | n_words | vader_sentiment | rel_char_len | average_word_length | all_caps | has_emoji | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1927 | 1928 | 46880 | Seattle, WA | 13-03-2020 | While I don't like all of Amazon's choices, to... | Positive | 31 | -0.1053 | 0.589286 | 5.640000 | 0 | 0 |
| 1068 | 1069 | 46021 | NaN | 13-03-2020 | Me: shit buckets, its time to do the weekly s... | Negative | 52 | -0.2500 | 0.932143 | 4.636364 | 0 | 0 |
| 803 | 804 | 45756 | The Outer Limits | 12-03-2020 | @SecPompeo @realDonaldTrump You mean the plan ... | Neutral | 44 | 0.0000 | 0.910714 | 6.741935 | 0 | 0 |
| 2846 | 2847 | 47799 | Flagstaff, AZ | 15-03-2020 | @lauvagrande People who are sick arent panic ... | Extremely Negative | 46 | -0.8481 | 0.907143 | 5.023810 | 0 | 0 |
| 3768 | 3769 | 48721 | Montreal, Canada | 16-03-2020 | Coronavirus Panic: Toilet Paper Is the People... | Negative | 21 | -0.5106 | 0.500000 | 9.846154 | 0 | 0 |
train_df.shape
(3038, 12)
(train_df['all_caps'] == 1).sum()
0
X_train = train_df.drop(columns=['Sentiment'])
numeric_features = ['vader_sentiment',
'rel_char_len',
'average_word_length']
passthrough_features = ['all_caps', 'has_emoji']
text_feature = 'OriginalTweet'
drop_features = ['UserName', 'ScreenName', 'Location', 'TweetAt']
preprocessor = make_column_transformer(
(StandardScaler(), numeric_features),
("passthrough", passthrough_features),
(CountVectorizer(stop_words='english'), text_feature),
("drop", drop_features)
)
pipe = make_pipeline(preprocessor, LogisticRegression(max_iter=1000))
results["LR (more feats)"] = mean_std_cross_val_scores(
pipe, X_train, y_train, return_train_score=True, scoring=scoring_metrics
)
pd.DataFrame(results).T
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| dummy | 0.004 (+/- 0.001) | 0.003 (+/- 0.001) | 0.280 (+/- 0.001) | 0.280 (+/- 0.000) |
| logistic regression | 4.959 (+/- 1.636) | 0.098 (+/- 0.011) | 0.413 (+/- 0.011) | 0.999 (+/- 0.000) |
| LR (more feats) | 5.383 (+/- 0.902) | 0.121 (+/- 0.011) | 0.689 (+/- 0.007) | 0.998 (+/- 0.001) |
pipe.fit(X_train, y_train)
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['vader_sentiment',
'rel_char_len',
'average_word_length']),
('passthrough', 'passthrough',
['all_caps', 'has_emoji']),
('countvectorizer',
CountVectorizer(stop_words='english'),
'OriginalTweet'),
('drop', 'drop',
['UserName', 'ScreenName',
'Location', 'TweetAt'])])),
('logisticregression', LogisticRegression(max_iter=1000))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['vader_sentiment',
'rel_char_len',
'average_word_length']),
('passthrough', 'passthrough',
['all_caps', 'has_emoji']),
('countvectorizer',
CountVectorizer(stop_words='english'),
'OriginalTweet'),
('drop', 'drop',
['UserName', 'ScreenName',
'Location', 'TweetAt'])])),
('logisticregression', LogisticRegression(max_iter=1000))])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['vader_sentiment', 'rel_char_len',
'average_word_length']),
('passthrough', 'passthrough',
['all_caps', 'has_emoji']),
('countvectorizer',
CountVectorizer(stop_words='english'),
'OriginalTweet'),
('drop', 'drop',
['UserName', 'ScreenName', 'Location',
'TweetAt'])])['vader_sentiment', 'rel_char_len', 'average_word_length']
StandardScaler()
['all_caps', 'has_emoji']
passthrough
OriginalTweet
CountVectorizer(stop_words='english')
['UserName', 'ScreenName', 'Location', 'TweetAt']
drop
LogisticRegression(max_iter=1000)
cv_feats = pipe.named_steps['columntransformer'].named_transformers_['countvectorizer'].get_feature_names_out().tolist()
feat_names = numeric_features + passthrough_features + cv_feats
coefs = pipe.named_steps['logisticregression'].coef_[0]
df = pd.DataFrame(
data={
"features": feat_names,
"coefficients": coefs,
}
)
df.sort_values('coefficients')
| features | coefficients | |
|---|---|---|
| 0 | vader_sentiment | -6.141919 |
| 11331 | won | -1.369740 |
| 2551 | coronapocalypse | -0.809931 |
| 2214 | closed | -0.744717 |
| 8661 | retail | -0.723808 |
| ... | ... | ... |
| 9862 | stupid | 1.157669 |
| 3299 | don | 1.159067 |
| 4879 | hell | 1.311957 |
| 3129 | die | 1.366538 |
| 7504 | panic | 1.527156 |
11664 rows × 2 columns
We get some improvements with our engineered features!
spaCy.The algorithms we used are very standard for Kagglers ... We spent most of our efforts in feature engineering...
- Xavier Conort, on winning the Flight Quest challenge on Kaggle

Find the features (columns) $X$ that are important for predicting $y$, and remove the features that aren't.
Given $X = \begin{bmatrix}x_1 & x_2 & \dots & x_n\\ \\ \\ \end{bmatrix}$ and $y = \begin{bmatrix}\\ \\ \\ \end{bmatrix}$, find the columns $1 \leq j \leq n$ in $X$ that are important for predicting $y$.
Feature selection can often result in better performing (less overfit), easier to understand, and faster model.
sklearn:from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0, test_size=0.5
)
X_train.shape
(284, 30)
pipe_lr_all_feats = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
pipe_lr_all_feats.fit(X_train, y_train)
pd.DataFrame(
cross_validate(pipe_lr_all_feats, X_train, y_train, return_train_score=True)
).mean()
fit_time 0.011689 score_time 0.001489 test_score 0.968233 train_score 0.987681 dtype: float64
SelectFromModel transformer.RandomForestClassifier for feature selection with threahold "median" of feature importances.from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
select_rf = SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold="median"
)
Can we use KNN to select features?
from sklearn.neighbors import KNeighborsClassifier
select_knn = SelectFromModel(
KNeighborsClassifier(),
threshold="median"
)
pipe_lr_model_based = make_pipeline(
StandardScaler(), select_knn, LogisticRegression(max_iter=1000)
)
#pd.DataFrame(
# cross_validate(pipe_lr_model_based, X_train, y_train, return_train_score=True)#
#).mean()
No KNN won't work since it does not report feature importances.
What about SVC?
select_svc = SelectFromModel(
SVC(), threshold="median"
)
# pipe_lr_model_based = make_pipeline(
# StandardScaler(), select_svc, LogisticRegression(max_iter=1000)
# )
# pd.DataFrame(
# cross_validate(pipe_lr_model_based, X_train, y_train, return_train_score=True)
# ).mean()
Only with a linear kernel but not with RBF kernel
We can put the feature selection transformer in a pipeline.
pipe_lr_model_based = make_pipeline(
StandardScaler(), select_rf, LogisticRegression(max_iter=1000)
)
pd.DataFrame(
cross_validate(pipe_lr_model_based, X_train, y_train, return_train_score=True)
).mean()
fit_time 0.231781 score_time 0.020413 test_score 0.950564 train_score 0.974480 dtype: float64
pipe_lr_model_based.fit(X_train, y_train)
pipe_lr_model_based.named_steps["selectfrommodel"].transform(X_train).shape
(284, 15)
Similar results with only 15 features instead of 30 features.
coef_ or feature_importances_.Note that this is not the same as just removing all the less important features in one shot!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
from sklearn.feature_selection import RFE
# create ranking of features
rfe = RFE(LogisticRegression(), n_features_to_select=5)
rfe.fit(X_train_scaled, y_train)
rfe.ranking_
array([16, 12, 19, 13, 23, 20, 10, 1, 9, 22, 2, 25, 5, 7, 15, 4, 26,
18, 21, 8, 1, 1, 1, 6, 14, 24, 3, 1, 17, 11])
print(rfe.support_)
[False False False False False False False True False False False False False False False False False False False False True True True False False False False True False False]
print("selected features: ", cancer.feature_names[rfe.support_])
selected features: ['mean concave points' 'worst radius' 'worst texture' 'worst perimeter' 'worst concave points']
n_features_to_select?RFECV which uses cross-validation to select number of features.from sklearn.feature_selection import RFECV
rfe_cv = RFECV(LogisticRegression(max_iter=2000), cv=10)
rfe_cv.fit(X_train_scaled, y_train)
print(rfe_cv.support_)
print(cancer.feature_names[rfe_cv.support_])
[False True False True False False True True True False True False True True False True False False False True True True True True False False True True False True] ['mean texture' 'mean area' 'mean concavity' 'mean concave points' 'mean symmetry' 'radius error' 'perimeter error' 'area error' 'compactness error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst concavity' 'worst concave points' 'worst fractal dimension']
rfe_pipe = make_pipeline(
StandardScaler(),
RFECV(LogisticRegression(max_iter=2000), cv=10),
RandomForestClassifier(n_estimators=100, random_state=42),
)
pd.DataFrame(cross_validate(rfe_pipe, X_train, y_train, return_train_score=True)).mean()
fit_time 1.735235 score_time 0.012351 test_score 0.943609 train_score 1.000000 dtype: float64

# from sklearn.feature_selection import SequentialFeatureSelector
# pipe_forward = make_pipeline(
# StandardScaler(),
# SequentialFeatureSelector(LogisticRegression(max_iter=1000),
# direction="forward",
# n_features_to_select='auto',
# tol=None),
# RandomForestClassifier(n_estimators=100, random_state=42),
# )
# pd.DataFrame(
# cross_validate(pipe_forward, X_train, y_train, return_train_score=True)
# ).mean()
# pipe_forward = make_pipeline(
# StandardScaler(),
# SequentialFeatureSelector(
# LogisticRegression(max_iter=1000),
# direction="backward",
# n_features_to_select=15),
# RandomForestClassifier(n_estimators=100, random_state=42),
# )
# pd.DataFrame(
# cross_validate(pipe_forward, X_train, y_train, return_train_score=True)
# ).mean()
Select all of the following statements which are TRUE.
rfe.ranking_ is the same as the order of original feature importances given by the model.
What if you are given "baby" feature?
Now the sex feature becomes relevant.
General problem (context specific relevance)
A feature is only relevant in the context of other features.
Confounding factors can make irrelevant features the most relevant.
If features can be predicted from other other features, you cannot know which one to pick.
Relevance for features does not have a causal relationship.
Is feature selection completely hopeless?
RFE, simulated annealing, genetic algorithms)